Learn why linear regression is a useful baseline approach but is often insufficiently flexible for more complex analyses, how to smooth noisy data, and how to use matrices for machine learning.
After completing this section, you will be able to:
Use linear regression for prediction as a baseline approach.
Use logistic regression for categorical data.
Detect trends in noisy data using smoothing (also known as curve fitting or low pass filtering).
Convert predictors to matrices and outcomes to vectors when all predictors are numeric (or can be converted to numerics in a meaningful way).
Perform basic matrix algebra calculations.
This section has three parts: linear regression for prediction, smoothing, and working with matrices.
#install.packages("HistData")
library(HistData)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
galton_heights=GaltonFamilies%>%
filter(childNum==1 & gender=="male")%>%
select(father,childHeight)%>%
rename(son=childHeight)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
y=galton_heights$son
test_index=createDataPartition(y,times = 1,p=0.5,list=F)
train_set=galton_heights%>%slice(-test_index)
test_set=galton_heights%>%slice(test_index)
avg=mean(train_set$son)
avg
## [1] 70.43409
mean((avg-test_set$son)^2) #R squared loss
## [1] 5.999804
conditional expectation is equivalent to the regression line
f(x) = E(Y | X = x) = β0 + β1x
fit=lm(son~father,data=train_set)
fit$coef
## (Intercept) father
## 36.421303 0.493077
y_hat=fit$coef[1]+fit$coef[2]*test_set$father
mean((y_hat-test_set$son)^2)
## [1] 4.791676
y_hat=predict(fit,test_set)
mean((y_hat-test_set$son)^2)
## [1] 4.791676
# ?predict.lm
# ?predict.glm
Q1
Create a data set using the following code:
set.seed(1)
n <- 100
Sigma <- 9*matrix(c(1.0, 0.5, 0.5, 1.0), 2, 2)
dat <- MASS::mvrnorm(n = 100, c(69, 69), Sigma) %>%
data.frame() %>% setNames(c("x", "y"))
We will build 100 linear models using the data above and calculate the mean and standard deviation of the combined models. First, set the seed to 1. Within a replicate loop,
(1) partition the dataset into test and training sets of equal size using dat$y to generate your indices,
(2) train a linear model predicting y from x,
(3) generate predictions on the test set, and
(4) calculate the RMSE of that model. Then, report the mean and standard deviation of the RMSEs from all 100 models.
Mean:
#Ans
set.seed(1)
rmse <- replicate(100, { #(4)
test_index <- createDataPartition(dat$y, times = 1, p = 0.5, list = FALSE) #(1)
train_set <- dat %>% slice(-test_index)
test_set <- dat %>% slice(test_index)
fit <- lm(y ~ x, data = train_set) #(2)
y_hat <- predict(fit, newdata = test_set) #(3)
sqrt(mean((y_hat-test_set$y)^2))
})
mean(rmse)
## [1] 2.488661
SD:
sd(rmse)
## [1] 0.1243952
Q2
Now we will repeat the exercise above but using larger datasets. Write a function that takes a size n, then
(1) builds a dataset using the code provided in Q1 but with n observations instead of 100 and without the set.seed(1),
(2) runs the replicate loop that you wrote to answer Q1, which builds 100 linear models and returns a vector of RMSEs, and
(3) calculates the mean and standard deviation.
Set the seed to 1 and then use sapply or map to apply this function to n <- c(100, 500, 1000, 5000, 10000).
Hint: You only need to set the seed once before running your function; do not set a seed within your function. Also be sure to use sapply or map as you will get different answers running the simulations individually due to setting the seed.
#Ans
set.seed(1)
n <- c(100, 500, 1000, 5000, 10000) #(1)
res <- sapply(n, function(n){
Sigma <- 9*matrix(c(1.0, 0.5, 0.5, 1.0), 2, 2)
dat <- MASS::mvrnorm(n, c(69, 69), Sigma) %>%
data.frame() %>% setNames(c("x", "y"))
rmse <- replicate(100, { #(2)
test_index <- createDataPartition(dat$y, times = 1, p = 0.5, list = FALSE)
train_set <- dat %>% slice(-test_index)
test_set <- dat %>% slice(test_index)
fit <- lm(y ~ x, data = train_set)
y_hat <- predict(fit, newdata = test_set)
sqrt(mean((y_hat-test_set$y)^2))
})
c(avg = mean(rmse), sd = sd(rmse)) #(3)
})
res
## [,1] [,2] [,3] [,4] [,5]
## avg 2.4977540 2.72095125 2.55554451 2.62482800 2.61844227
## sd 0.1180821 0.08002108 0.04560258 0.02309673 0.01689205
Mean, 100: 2.498
SD, 100: 0.118
Mean, 500: 2.72
SD, 500: 0.08
Mean, 1000: 2.5555
SD, 1000: 0.0456
Mean, 5000: 2.6248
SD, 5000: 0.0231
Mean, 10000: 2.6184
SD, 10000: 0.0169
Q3
What happens to the RMSE as the size of the dataset becomes larger?
On average, the RMSE does not change much as n gets larger, but the variability of the RMSE decreases.
Because of the law of large numbers the RMSE decreases; more data means more precise estimates.
n = 10000 is not sufficiently large. To see a decrease in the RMSE we would need to make it larger.
The RMSE is not a random variable.
Q4
Now repeat the exercise from Q1, this time making the correlation between x and y larger, as in the following code:
set.seed(1)
n <- 100
Sigma <- 9*matrix(c(1.0, 0.95, 0.95, 1.0), 2, 2)
dat <- MASS::mvrnorm(n = 100, c(69, 69), Sigma) %>%
data.frame() %>% setNames(c("x", "y"))
Note what happens to RMSE - set the seed to 1 as before.
Mean:
rmse <- replicate(100, {
test_index <- createDataPartition(dat$y, times = 1, p = 0.5, list = FALSE)
train_set <- dat %>% slice(-test_index)
test_set <- dat %>% slice(test_index)
fit <- lm(y ~ x, data = train_set)
y_hat <- predict(fit, newdata = test_set)
sqrt(mean((y_hat-test_set$y)^2))
})
mean(rmse)
## [1] 0.9167387
SD:
sd(rmse)
## [1] 0.05226268
#correct: 0.0624
Q5
Which of the following best explains why the RMSE in question 4 is so much lower than the RMSE in question 1?
It is just luck. If we do it again, it will be larger.
The central limit theorem tells us that the RMSE is normal.
When we increase the correlation between x and y, x has more predictive power and thus provides a better estimate of y.
These are both examples of regression so the RMSE has to be the same.
Q6
Create a data set using the following code.
set.seed(1)
Sigma <- matrix(c(1.0, 0.75, 0.75, 0.75, 1.0, 0.25, 0.75, 0.25, 1.0), 3, 3)
dat <- MASS::mvrnorm(n = 100, c(0, 0, 0), Sigma) %>%
data.frame() %>% setNames(c("y", "x_1", "x_2"))
Note that y is correlated with both x_1 and x_2 but the two predictors are independent of each other, as seen by cor(dat).
Set the seed to 1, then use the caret package to partition into a test and training set of equal size. Compare the RMSE when using just x_1, just x_2 and both x_1 and x_2. Train a linear model for each.
Which of the three models performs the best (has the lowest RMSE)?
x_1
x_2
x_1 and x_2
#Ans
set.seed(1)
test_index <- createDataPartition(dat$y, times = 1, p = 0.5, list = FALSE)
train_set <- dat %>% slice(-test_index)
test_set <- dat %>% slice(test_index)
fit <- lm(y ~ x_1, data = train_set)
y_hat <- predict(fit, newdata = test_set)
sqrt(mean((y_hat-test_set$y)^2))
## [1] 0.600666
fit <- lm(y ~ x_2, data = train_set)
y_hat <- predict(fit, newdata = test_set)
sqrt(mean((y_hat-test_set$y)^2))
## [1] 0.630699
fit <- lm(y ~ x_1 + x_2, data = train_set)
y_hat <- predict(fit, newdata = test_set)
sqrt(mean((y_hat-test_set$y)^2))
## [1] 0.3070962
Q7
Report the lowest RMSE of the three models tested in Q6.
The lowest RMSE is for the model that includes x_1 and x_2 as predictors: 0.3070962.
Q8
Repeat the exercise from q6 but now create an example in which x_1 and x_2 are highly correlated.
set.seed(1)
Sigma <- matrix(c(1.0, 0.75, 0.75, 0.75, 1.0, 0.95, 0.75, 0.95, 1.0), 3, 3)
dat <- MASS::mvrnorm(n = 100, c(0, 0, 0), Sigma) %>%
data.frame() %>% setNames(c("y", "x_1", "x_2"))
Set the seed to 1, then use the caret package to partition into a test and training set of equal size. Compare the RMSE when using just x_1, just x_2, and both x_1 and x_2.
Compare the results from q6 and q8. What can you conclude?
Unless we include all predictors we have no predictive power.
Adding extra predictors improves RMSE regardless of whether the added predictors are correlated with other predictors or not.
Adding extra predictors results in over fitting.
Adding extra predictors can improve RMSE substantially, but not when the added predictors are highly correlated with other predictors.
Explanation
set.seed(1)
test_index <- createDataPartition(dat$y, times = 1, p = 0.5, list = FALSE)
train_set <- dat %>% slice(-test_index)
test_set <- dat %>% slice(test_index)
fit <- lm(y ~ x_1, data = train_set)
y_hat <- predict(fit, newdata = test_set)
sqrt(mean((y_hat-test_set$y)^2))
## [1] 0.6592608
fit <- lm(y ~ x_2, data = train_set)
y_hat <- predict(fit, newdata = test_set)
sqrt(mean((y_hat-test_set$y)^2))
## [1] 0.640081
fit <- lm(y ~ x_1 + x_2, data = train_set)
y_hat <- predict(fit, newdata = test_set)
sqrt(mean((y_hat-test_set$y)^2))
## [1] 0.6597865
When the predictors are highly correlated with each other, adding addtional predictors does not improve the model substantially, thus RMSE stays roughly the same.
library(dslabs)
data("heights")
y=heights$height
set.seed(2)
test_index=createDataPartition(y,times = 1,p=0.5,list=F)
train_set=heights%>%slice(-test_index)
test_set=heights%>%slice(test_index)
train_set%>%
filter(round(height)==66)%>%
summarize(mean(sex=="Female"))
## mean(sex == "Female")
## 1 0.2424242
lm_fit=mutate(train_set, y=as.numeric(sex=="Female"))%>%
lm(y~height, data=.)
p_hat=predict(lm_fit,test_set)
y_hat=ifelse(p_hat>0.5,"Female","Male")%>%factor()
confusionMatrix(y_hat, test_set$sex)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Female Male
## Female 20 15
## Male 98 393
##
## Accuracy : 0.7852
## 95% CI : (0.7476, 0.8195)
## No Information Rate : 0.7757
## P-Value [Acc > NIR] : 0.3218
##
## Kappa : 0.177
##
## Mcnemar's Test P-Value : 1.22e-14
##
## Sensitivity : 0.16949
## Specificity : 0.96324
## Pos Pred Value : 0.57143
## Neg Pred Value : 0.80041
## Prevalence : 0.22433
## Detection Rate : 0.03802
## Detection Prevalence : 0.06654
## Balanced Accuracy : 0.56636
##
## 'Positive' Class : Female
##
glm_fit=train_set%>%
mutate(y=as.numeric(sex=="Female"))%>%
glm(y~height,data=.,family = "binomial")
p_hat_logit=predict(glm_fit,newdata = test_set,type="response")
y_hat_logit=ifelse(p_hat_logit>0.5,"Female","Male")%>%factor
confusionMatrix(y_hat_logit,test_set$sex)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Female Male
## Female 31 19
## Male 87 389
##
## Accuracy : 0.7985
## 95% CI : (0.7616, 0.832)
## No Information Rate : 0.7757
## P-Value [Acc > NIR] : 0.1138
##
## Kappa : 0.2718
##
## Mcnemar's Test P-Value : 7.635e-11
##
## Sensitivity : 0.26271
## Specificity : 0.95343
## Pos Pred Value : 0.62000
## Neg Pred Value : 0.81723
## Prevalence : 0.22433
## Detection Rate : 0.05894
## Detection Prevalence : 0.09506
## Balanced Accuracy : 0.60807
##
## 'Positive' Class : Female
##
data("mnist_27")
fit=glm(y~ x_1 + x_2,data=mnist_27$train, family = "binomial")
p_hat=predict(fit,newdata = mnist_27$test)
y_hat=factor(ifelse(p_hat>0.5,7,2))
confusionMatrix(data=y_hat,reference = mnist_27$test$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 2 7
## 2 92 34
## 7 14 60
##
## Accuracy : 0.76
## 95% CI : (0.6947, 0.8174)
## No Information Rate : 0.53
## P-Value [Acc > NIR] : 1.668e-11
##
## Kappa : 0.5124
##
## Mcnemar's Test P-Value : 0.006099
##
## Sensitivity : 0.8679
## Specificity : 0.6383
## Pos Pred Value : 0.7302
## Neg Pred Value : 0.8108
## Prevalence : 0.5300
## Detection Rate : 0.4600
## Detection Prevalence : 0.6300
## Balanced Accuracy : 0.7531
##
## 'Positive' Class : 2
##
mnist_27$true_p %>% ggplot(aes(x_1,x_2,fill=p)) +
geom_raster()
mnist_27$true_p %>% ggplot(aes(x_1,x_2,z=p,fill=p)) +
geom_raster() +
scale_fill_gradientn(colors=c("#F8766D","white","#00BFC4")) +
stat_contour(breaks=c(0.5),color="black")
Q1
Define a dataset using the following code:
set.seed(2)
make_data <- function(n = 1000, p = 0.5,
mu_0 = 0, mu_1 = 2,
sigma_0 = 1, sigma_1 = 1){
y <- rbinom(n, 1, p)
f_0 <- rnorm(n, mu_0, sigma_0)
f_1 <- rnorm(n, mu_1, sigma_1)
x <- ifelse(y == 1, f_1, f_0)
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
list(train = data.frame(x = x, y = as.factor(y)) %>% slice(-test_index),
test = data.frame(x = x, y = as.factor(y)) %>% slice(test_index))
}
dat <- make_data()
Note that we have defined a variable x that is predictive of a binary outcome y:
dat$train %>% ggplot(aes(x, color = y)) + geom_density().
Set the seed to 1, then use the make_data function defined above to generate 25 different datasets with mu_1 <- seq(0, 3, len=25). Perform logistic regression on each of the 25 different datasets (predict 1 if p>0.5) and plot accuracy (res in the figures) vs mu_1 (delta in the figures).”
Which is the correct plot?
Explanation
set.seed(1)
delta <- seq(0, 3, len = 25)
res <- sapply(delta, function(d){
dat <- make_data(mu_1 = d)
fit_glm <- dat$train %>% glm(y ~ x, family = "binomial", data = .)
y_hat_glm <- ifelse(predict(fit_glm, dat$test) > 0.5, 1, 0) %>% factor(levels = c(0, 1))
mean(y_hat_glm == dat$test$y)
})
qplot(delta, res)
data("polls_2008")
qplot(day,margin,data=polls_2008)
In the poll example, for each day, we would compute the average for the values within a week of the day that we’re considering.
The black points are the points that are used to compute the average at those two points.
The blue line represents the average that was computed.
span=7
fit=with(polls_2008,
ksmooth(day,margin,x.points = day,kernel = "box",bandwidth = span))
polls_2008 %>% mutate(smooth=fit$y) %>%
ggplot(aes(day,margin)) +
geom_point(size=3, alpha=0.5,color="grey") +
geom_line(aes(day,margin),color="red")
span=7
fit=with(polls_2008,
ksmooth(day,margin,x.points = day,kernel = "box",bandwidth = span))
polls_2008 %>% mutate(smooth=fit$y) %>%
ggplot(aes(day,margin)) +
geom_point(size=3, alpha=0.5,color="grey") +
geom_line(aes(day,smooth),color="red")
total_days=diff(range(polls_2008$day))
span=21/total_days
fit=loess(margin~day,degree=1,span=span,data=polls_2008)
polls_2008 %>% mutate(smooth=fit$fitted) %>%
ggplot(aes(day,margin)) +
geom_point(size=3, alpha=0.5,color="grey") +
geom_line(aes(day,smooth),color="red")
There are three other differences between loess and the typical bin smoother.
The first is that rather than keeping the bin size the same, loess keeps the number of points used in the local fit the same. This number is controlled via the span argument which expects a proportion.
So for example, if N is a number of data points, and the span is 0.5, then for any given X, loess will use 0.5 * N closest points to X for the fit.
Another difference is that when fitting a line locally, loess uses a weighted approach. Basically, instead of least squares, we minimize a weighted version:
However, instead of the Gaussian kernel, loess uses a function called the Tukey tri-weight:
to define weights:
The kernel for the tri-weight looks like this.
The third difference is that loess has the option of fitting the local model robustly. An iterative algorithm is implemented in which, after fitting a model in one iteration, outliers are detected and down-weighted for the next iteration.
To use this option, use the argument family = "symmetric".
Taylor’s theorem also tells us that if you look at a function close enough, it looks like a parabola and that you don’t have to look as close as you do for the linear approximation.
This means we can make our windows even larger and fit parabolas instead of lines, so the local model would look like this:
polls_2008 %>% ggplot(aes(day,margin)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
polls_2008 %>% ggplot(aes(day,margin)) +
geom_point() +
geom_smooth(color="red",span=0.15,
method.args=list(degree=1))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Q1
In the Wrangling course of this series, PH125.6x, we used the following code to obtain mortality counts for Puerto Rico for 2015-2018:
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 2.0.1 ✔ purrr 0.3.0
## ✔ tidyr 0.8.2 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
library(purrr)
library(pdftools)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
fn <- system.file("extdata", "RD-Mortality-Report_2015-18-180531.pdf", package="dslabs")
dat <- map_df(str_split(pdf_text(fn), "\n"), function(s){
s <- str_trim(s)
header_index <- str_which(s, "2015")[1]
tmp <- str_split(s[header_index], "\\s+", simplify = TRUE)
month <- tmp[1]
header <- tmp[-1]
tail_index <- str_which(s, "Total")
n <- str_count(s, "\\d+")
out <- c(1:header_index, which(n==1), which(n>=28), tail_index:length(s))
s[-out] %>%
str_remove_all("[^\\d\\s]") %>%
str_trim() %>%
str_split_fixed("\\s+", n = 6) %>%
.[,1:5] %>%
as_data_frame() %>%
setNames(c("day", header)) %>%
mutate(month = month,
day = as.numeric(day)) %>%
gather(year, deaths, -c(day, month)) %>%
mutate(deaths = as.numeric(deaths))
}) %>%
mutate(month = recode(month, "JAN" = 1, "FEB" = 2, "MAR" = 3, "APR" = 4, "MAY" = 5, "JUN" = 6,
"JUL" = 7, "AGO" = 8, "SEP" = 9, "OCT" = 10, "NOV" = 11, "DEC" = 12)) %>%
mutate(date = make_date(year, month, day)) %>%
filter(date <= "2018-05-01")
## Warning: `as_data_frame()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
Use the loess function to obtain a smooth estimate of the expected number of deaths as a function of date. Plot this resulting smooth function. Make the span about two months long.
Which of the following plots is correct?
Explanation
span <- 60 / as.numeric(diff(range(dat$date)))
fit <- dat %>% mutate(x = as.numeric(date)) %>% loess(deaths ~ x, data = ., span = span, degree = 1)
dat %>% mutate(smooth = predict(fit, as.numeric(date))) %>%
ggplot() +
geom_point(aes(date, deaths)) +
geom_line(aes(date, smooth), lwd = 2, col = 2)
## Warning: Removed 1 rows containing missing values (geom_point).
The second plot uses a shorter span, the third plot uses the entire timespan, and the fourth plot uses a longer span.
Q2
Work with the same data as in Q1 to plot smooth estimates against day of the year, all on the same plot, but with different colors for each year.
Which code produces the desired plot?
dat %>%
mutate(smooth = predict(fit), day = yday(date), year = as.character(year(date))) %>%
ggplot(aes(day, smooth, col = year)) +
geom_line(lwd = 2)
dat %>%
mutate(smooth = predict(fit, as.numeric(date)), day = mday(date), year = as.character(year(date))) %>%
ggplot(aes(day, smooth, col = year)) +
geom_line(lwd = 2)
dat %>%
mutate(smooth = predict(fit, as.numeric(date)), day = yday(date), year = as.character(year(date))) %>%
ggplot(aes(day, smooth)) +
geom_line(lwd = 2)
dat %>% mutate(smooth = predict(fit, as.numeric(date)), day = yday(date), year = as.character(year(date))) %>% ggplot(aes(day, smooth, col = year)) + geom_line(lwd = 2)
Q3
Suppose we want to predict 2s and 7s in the mnist_27 dataset with just the second covariate. Can we do this? On first inspection it appears the data does not have much predictive power.
In fact, if we fit a regular logistic regression the coefficient for x_2 is not significant!
This can be seen using this code:
library(broom)
mnist_27$train %>% glm(y ~ x_2, family = "binomial", data = .) %>% tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.0907 0.247 -0.368 0.713
## 2 x_2 0.685 0.827 0.829 0.407
Plotting a scatterplot here is not useful since y is binary:
qplot(x_2, y, data = mnist_27$train)
Fit a loess line to the data above and plot the results. What do you observe?
There is no predictive power and the conditional probability is linear.
There is no predictive power and the conditional probability is non-linear.
There is predictive power and the conditional probability is linear.
There is predictive power and the conditional probability is non-linear.
#Ans
mnist_27$train %>%
mutate(y = ifelse(y=="7", 1, 0)) %>%
ggplot(aes(x_2, y)) +
geom_smooth(method = "loess")
mnist=read_mnist()
class(mnist$train$images)
## [1] "matrix"
x=mnist$train$images[1:1000,]
y=mnist$train$labels[1:1000]
5 challenges:
1. study the distribution of the total pixel darkness and how it varies by digits.
2. study the variation of each pixel and remove predictors, columns, associated with pixels that don’t change much and thus can’t provide much information for classification.
3. zero out low values that are likely smudges. First, we’re going to look at the distribution of all pixel values, use this to pick a cutoff to define unwritten space, then make anything below that cutoff a 0.
4. binarize the data. We’re going to first look at the distribution of all pixel values, use this to pick a cutoff, and distinguish between writing and no writing. Then convert all entries into either zero or one respectively.
5. scale each of the predictors in each entry to have the same average and standard deviation.
In matrix algebra we have three main types of objects: scalars, vectors, and matrices.
length(x[1,])
## [1] 784
dim(as.matrix(x))
## [1] 1000 784
my_vector=1:15
mat=matrix(my_vector,5,3)
mat
## [,1] [,2] [,3]
## [1,] 1 6 11
## [2,] 2 7 12
## [3,] 3 8 13
## [4,] 4 9 14
## [5,] 5 10 15
mat_t=matrix(my_vector,3,5,byrow = T)
mat_t
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 6 7 8 9 10
## [3,] 11 12 13 14 15
identical(t(mat),mat_t)
## [1] TRUE
matrix(my_vector,5,5) #the product of columns and rows does not match the length of the vector
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 6 11 1 6
## [2,] 2 7 12 2 7
## [3,] 3 8 13 3 8
## [4,] 4 9 14 4 9
## [5,] 5 10 15 5 10
grid=matrix(x[3,],28,28)
image(1:28,1:28,grid)
image(1:28,1:28,grid[,28:1])
sums=rowSums(x)
avg=rowMeans(x)
avgs=apply(x,1,mean)
sds=apply(x,2,sd)
library(matrixStats)
##
## Attaching package: 'matrixStats'
## The following object is masked from 'package:dplyr':
##
## count
sds=colSds(x)
#x[,c(351,352)]
#x[c(2,3),]
new_x=x[,colSds(x)>60]
dim(new_x)
## [1] 1000 314
class(x[,1])
## [1] "integer"
dim(x[1,])
## NULL
class(x[,1,drop=F])
## [1] "matrix"
dim(x[ ,1,drop=F])
## [1] 1000 1
mat=matrix(1:15,5,3)
mat
## [,1] [,2] [,3]
## [1,] 1 6 11
## [2,] 2 7 12
## [3,] 3 8 13
## [4,] 4 9 14
## [5,] 5 10 15
as.vector(mat)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
qplot(as.vector(x),bins=30,color=I("black"))
new_x=x
new_x[new_x<50]=0
mat=matrix(1:15,5,3)
mat[mat<3]=0
mat
## [,1] [,2] [,3]
## [1,] 0 6 11
## [2,] 0 7 12
## [3,] 3 8 13
## [4,] 4 9 14
## [5,] 5 10 15
mat=matrix(1:15,5,3)
mat[mat>6 & mat<12]=0
mat
## [,1] [,2] [,3]
## [1,] 1 6 0
## [2,] 2 0 12
## [3,] 3 0 13
## [4,] 4 0 14
## [5,] 5 0 15
bin_x=x
bin_x[bin_x < 255/2]=0
bin_x[bin_x > 255/2]=1
bin_x=(x>255/2)*1
subtract a vector from a matrix.
(x - rowMeans(x)) / rowSds(x)
t(t(x) - colMeans(x))
x_mean_0=sweep(x,2,colMeans(x))
x_standardized=sweep(x_mean_0,2,colSds(x),FUN="/")
t(x) %*% x
crossprod(x)
solve(crossprod(x)) #To compute the inverse of a function
qr(x) #qr decomposition
Q1
Which line of code correctly creates a 100 by 10 matrix of randomly generated normal numbers and assigns it to x?
x <- matrix(rnorm(1000), 100, 100)
x <- matrix(rnorm(100*10), 100, 10)
x <- matrix(rnorm(100*10), 10, 10)
x <- matrix(rnorm(100*10), 10, 100)
Q2
Write the line of code that would give you the specified information about the matrix x that you generated in q1. Do not include any spaces in your line of code.
Dimension of x.
x <- matrix(rnorm(100*10), 100, 10)
dim(x)
## [1] 100 10
Number of rows of x.
nrow(x)
## [1] 100
dim(x)[1]
## [1] 100
Number of columns of x.
ncol(x)
## [1] 10
dim(x)[2]
## [1] 10
Q3
Which of the following lines of code would add the scalar 1 to row 1, the scalar 2 to row 2, and so on, for the matrix x?
x <- x + seq(nrow(x))
x <- 1:nrow(x)
x <- sweep(x, 2, 1:nrow(x),"+")
x <- sweep(x, 1, 1:nrow(x),"+")
Q4
Which of the following lines of code would add the scalar 1 to column 1, the scalar 2 to column 2, and so on, for the matrix x?
x <- 1:ncol(x)
x <- 1:col(x)
x <- x <- sweep(x, 2, 1:ncol(x), FUN = "+")
x <- -x
Q5
Which code correctly computes the average of each row of x?
mean(x)
rowMedians(x)
sapply(x,mean)
rowSums(x)
rowMeans(x)
Which code correctly computes the average of each column of x?
mean(x)
sapply(x,mean)
colMeans(x)
colMedians(x)
colSums(x)
Q6
For each digit in the mnist training data, compute the proportion of pixels that are in the grey area, defined as values between 50 and 205. (To visualize this, you can make a boxplot by digit class.)
What proportion of pixels are in the grey area overall, defined as values between 50 and 205?
#Ans
#mnist <- read_mnist()
y <- rowMeans(mnist$train$images>50 & mnist$train$images<205)
mean(y)
## [1] 0.06183703
qplot(as.factor(mnist$train$labels), y, geom = "boxplot")